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■ 2346527 

VIRTOTAL ACTOR WITH SET OF SPEAKER PROFILES 

Field of ^hP TnvAnh-ir.ri 

The present invention relates to generating model parameters 
for animating virtual actors. 

Bankgroiinrl of ^hA Tn-.r^TlMnn 
Computer programs exist which include providing a feature 
that draws an animated figure which appears to be speaking text 
from sources such las e-mail, word processing documents and 
internet pages. Typically, this animated object is called a 
"Talking Head", though the head is seldom very realistic in its 
appearance or movements. By using "target frames", which are the 
frames in the animation that correspond to the middle of a phoneme 
being synthesized,; fairly realistic animation can be produced but 
the transitions between phonemes are typically very jumpy and 
unnatural looking.' Most of the "Talking Heads" are therefore 
limited in their usefulness due to their lack of realism. 

There are thrbe standard approaches to creating these 
"Talking Heads". The simplest approach is to store a limited 
number (typically between five and about one hundred) of pictures 
that correspond to: various lip positions. These images are 
recalled and displayed at the appropriate time so that the 
"Talking Head" appears to be moving. 

The second method is to store "key frames" which describe the 
position of the lips and other objects for target frames. 
Typically, these frames correspond to the middle of each of the 
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phonemes. The frames between the key frames are derived from the 
key frames by some form of interpolation. This method creates 
smoother animation but the interpolation is often too simplistic 
to create realistic motions. For example, when a person says 

5 "pop" the lips do not follow a simple linear trajectory form of 
closed to open to closed. The key frame method would most likely 
just average these locations, resulting in a "Talking Head" that 
still lacks the ability to realistically imitate human motions. 
Moreover, for the methods that store key frames, there is a 

10 trade-off between smoothness of the animation and the memory or 
disk space that is required to create the "Talking Head" , The 
storage requirement increases with the number of pictures that are 
stored which means that smooth animation requires substantial 
storage space. Systems with restricted memory and disk space are 

15 forced to use a smaller number of frames which makes the animation 
very jumpy and unnatural. 

The third option is to write rules to describe the lip 
locations. These rules can produce natural motion if enough data 
such as the size and weight of the lips, placement and strengths 

20 of the face muscles and jaw positions can be obtained. The 

problem is that the rules can be very complicated and can take a 
long time to write. The rule based system can also be difficult 
to modify in order to model a new object. 

At least one disadvantage with all three approaches is the 

25 cost of achieving fidelity. The frame rate at which the animation 
is rendered directly impacts fidelity. In order to support fast 
frame rates, the system must either incur a storage cost- for 
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utilizing an extensive database or incur a development cost for 
manually creating; a set of rules complex enough to accurately 
predict the control parameters of the "Talking Head" . 

Another disadvantage of the traditional methods is that it is 
5 often difficult t(i> capture the personal idiosyncrasies of the 

specific person tfeat is to be modeled. If it is desirable to have 
a "Talking Head" that imitates a person, then it is desirable to 
have it speak, nod and move like that person. 

Hence, there lis a need for a method and apparatus for 
10 improving the fidelity of autonomously rendered models without 
increasing the coat of animating the model. 



Briftf DpscrinM'on r.f i--h^ n^-^^^jng p . 
A preferred ejnbodiment of the invention is now described, by 
15 way of example onl^, with reference to the accompanying drawings 
in which: 

FIG. 1 is a schematic representation of a neural network 
based parameter system for generating model parameters in 
accordance with the present invention; 

FIG; 2 is a schematic representation of a device for 
generating a parameter database in accordance with the present 
invention; 

FIG. 3 is a schematic representation of a device for training 
a neural network in accordance with the present invention; 
25 FIG. 4 is a schematic representation of an embodiment of a 

neural network in accordance with the present invention; 



20 
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FIG. 5 is a general purpose computer implementing the present 
invention. 

DArailf^d Df^arri pi- inn o f M-ifi PrRff:>rr6^d Kmbodiment . 
5 The present invention provides a neural network based 

virtual actor (Talking Head) in parallel with a text-to-speech 
system which is used to control virtual actors (e.g., three- 
dimensional graphical objects) . Hence, the present invention 
provides a method and apparatus for efficiently generating spatial 

10 parameters from linguistic information using a neural network, 
thus enabling the autonomous rendering of animated models with 
high fidelity and low cost. 

It is desired to improve the quality of computer software 
through increased comprehension of speech synthesizers through the 

15 addition of a virtual actor which is a model of a talking human. 
However, the primary problem with the standard approach is that 
the traditional talking head does not move like a real person. A 
major contribution to the unnaturalness is due to the difficulty 
in modeling co-articulatory effects. Most models neglect both the 

20 effects of previous or future jaw and mouth positions and the 

complicated interactions which leads to unrealistic models. This 
means that there is insufficient detail to allow a deaf person to 
lip-read from the talking head and it also means that there is 
only slight improvements in intelligibility for normal hearing 

25 people . 

In contrast to the traditional methods, the present invention 
is ideal for the task of creating a virtual actor. One advantage 
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Of using a neural! network instead of traditional methods is that " 
the process is highly automated. Once the training points are 
generated for the: desired person or object, the training of a 
neural network involves very little human interaction. in this 
5 process, the neural network is capable of capturing the 

idiosyncrasies of j the subject, including facial gestures, head 
movements, shoulder shrugs, waving, and other gestures of the 
hands and face. in addition, since the neural network is trained 
on data from a reil speaker, the motions will be realistic and 
10 natural. The neural network will also be able to learn the 

appropriate co-articulatory effects which will further enhance the 
naturalness of the virtual actor. 

Referring to |the drawings, FIG. 1 is a schematic 
representation of !a neural network based parameter system for 
generating model piarameters in accordance with the present 
invention. In a preferred embodiment, the neural network based 
parameter system 118 is used to generate a virtual actor (visually 
rendered model with speech 132) whose movements are correlated 
with synthetic spefech (neural network synthesized speech 144) . 
Text 1Q2 is used t6 drive the virtual actor and the output is an 
animated object th^t is synchronized with synthetic speech 
(visually rendered! model with speech 132). Text 102 is first 
converted to a linguistic representation of speech 106 by a text- 
to- linguistics module 104. The linguistic representation of speech 
106 describes speech segments which are preferably phoneme 
segments. The text, therefore, consists of a series of segment 
descriptions which I are composed of the following: a phoneme 
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identifier associated with a phoneme in current and adjacent 
segment descriptions; articulatory features associated with the 
phoneme in current and adjacent segment descriptions; locations 
of syllables, words and other syntactic and intonational 

5 boundaries; duration of time between phonemes, syllables, words 
and other syntactic and intonational boundaries; syllable 
strength information; descriptive information of a word type; and 
prosodic information. The prosodic information further includes 
at least one of the following: locations of word endings and a 

10 degree of dis juncture between words, locations of pitch accents 
and a form of those accents, locations of boundaries marked in a 
pitch contour and a form of those boundaries, a time separating 
marked prosodic events from other marked prosodic events, and a 
number of prosodic events of some type in a time period separating 

15 a prosodic event of another type and a frame for which coder 
parameters are generated. 

The linguistic representation of speech 106 is converted to 
neural network linguistic parameters 110 by a pre-processor 108. 
It should be noted that the neural network linguistic parameters 

20 differ from acoustic parameters (such as linear prediction 
coefficients, line spectral frequencies, spatial energy 
parameters, pitch, etc.). The pre-processor 108 generates the 
neural network linguistic parameters 110 which are numbers that 
are suitable for use by a neural network. A neural network module 

25 112 is used to" convert the neural network linguistic parameters 

110 into raw speech parameters 115 and raw spatial parameters 114. 
Since the output of the neural network module 112 only generates 
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values in the range of zero and one, a post -processor 116 is used ' 
to convert the raw spatial parameters 114 and raw speech 
parameters 115 int!o model parameters 120 and a parametric 
representation, of ispeech. 134, respectively. The post -processor 
5 116 scales the output of the neural network module 112 and 

converts the output into parameters that are suitable for driving 
a visual rendering; module 122 and parameters that are suitable for 
use by a waveform synthesizer 142. The waveform synthesizer 142 
is the synthesis portion of a vocoder which requires thirteen 
10 parameters to be provided every ten milliseconds: one describing 
the fundamental frequency of the, speech, one describing the 
frequency of the vpiced/unvoiced bands, one describing the total 
energy of the ten millisecond frame and ten line spectral 
frequency parameters . describing the frequency spectrum of the 
15 frame. The output : of the waveform synthesizer 142 is synthetic 
speech (neural network synthesized speech 144) . 

Similarly, the visual rendering module 122 requires that at 
least six three -diniensional coordinates are provided every 33.33 
milliseconds. These coordinates describe the corners of the lips, 
20 the position of the! top lip, the position of the bottom lip, the 
position of the chiin and a reference position located on the nose. 
Preferably, these s'ix points are supplemented with points 
describing positions of the eyebrows, forehead and cheeks for a 
total of seventeen breference points. 
25 In addition, it is preferable to generate the degree to which the 
teeth and the tonguie are visible. It is also preferable to 
generate parameters: describing the direction that the eyes appear 
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to be looking and the frequency and speed of blinking the eyes. 
Together these points describeparameters describing degree of lip 
opening, lip protrusion, jaw position, tongue position, eye 
positions, eyebrow positions and parameters describing head 

5 position. Preferably, these are three-dimensional coordinates 
which correspond to nodes on a wire-frame model. Alternatively, 
the three-dimensional coordinates could also describe the wires 
connecting the nodes of the wire-frame model. 

The visual rendering module 122 displays an animated image of 

10 a virtual actor (visually rendered model 124) . The visually 

rendered model 124 and the neural network synthesized speech 144 
are synchronized by the combining module 130 which generates a 
visually rendered model with speech 132 in order to convince the 
user that the speech is being generated by the virtual actor. 

15 In order for the neural network module 112 to be useful, it 

must first be trained. FIG. 2 is a schematic representation of a 
system for creating a representative parameter vector database 
that is used to train the neural network module 112. 

The parameter database 218 is generated by having a human 

20 speaker 202 generate speech. An audio and motion recording system 
2 04 is used to record the speech and corresponding motions of the 
human speaker 202. The recordings are typically generated by 
placing two video cameras at a ninety-degree angle with respect to 
each other with one video camera set up to record the side view of 

25 the human's head and the other video camera is set up to record 
the front view of the human's head. 
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The audio is ^typically recorded by an audio recording device 
such as a digital jaudio tape recorder through a microphone placed 
in front of the hiiman speaker 202. Alternatively, the audio can 
also be recorded by the video cameras or motion capture equipment 
can be used in place of or in addition to the video equipment. 

Expression input 220 can be given to the human speaker 202 
which is typically in the form of instructions such as g Please 
speak the following sentence in an angry tone of voiceg . The 
expression input 2-20 is converted to expression parameters 222 by 
the linguistic labeler 214. The linguistic labeler 214 also 
generates time aligned linguistic representation of speech 216 
which are linguistic labels for each speech segment. The speech 
segments are typically phoneme segments arid the linguistic labels 
are composed of the following: phoneme identifier associated with 
each phoneme; articulatory features associated with each phoneme; 
locations of syllables, words and other syntactic and . intonational 
boundaries; duration of time between phonemes, syllables, words 
and other syntactic and intonational boundaries; syllable strength 
information; descriptive information of a word type; and prosodic 
information. The prosodic information is additionally composed of 
at least one of the following: the locations of word endings and 
the degree of dis juncture between words, the locations of pitch 
accents and the fotm of those accents, the locations of boundaries 
marked in the pitch contour and the form of those boundaries, the 
25 time separating mairked prosodic events from other marked prosodic 
events, and the nuniber of prosodic events of some type in the time 
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period separating a prosodic event of another type and the frame 
for which the coder parameters are being generated. 

As stated above, the expression input 220 that was given to 
the human speaker 202 is converted to expression parameters 222. 

5 This conversion changes the expression input 220 (which is 

typically of the form g Please speak the following text in a happy 
tone of voiceg ) into a text label that is time aligned with the 
speech waveform (which for this example would read g happy® ) . 
The linguistic labeler 214 receives audio information 212 and 

10 expression input 220 and converts the audio information 212 in 
linguistic representation of speech 216 and converts the 
expression input 220 into predetermined labels (expression 
parameters 222) . This process can also be done manually by a 
person who is skilled in the art of labeling speech. 

15 Spatial parameters 210 are extracted from the motion 

information 206 by a spatial parameter extractor 208. If the 
motion of the human speaker 202 is recorded using a video 
recorder, then the spatial parameter extractor 208 is typically a 
tracking algorithm that locates key points of a video recording of 

20 the human face. White dots can be affixed or marked on the face 
at the key locations on the human speaker 2 02 to help the tracking 
algorithm locate these key points. Typically dots are placed on 
the lips, chin, eyebrows, forehead and cheeks with a total of 
approximately eight to fifty dots. The spatial parameter 

25 extractor 208 then combines the front and side views from the 

video recording and generates three-dimensional coordinates of the 
key points. 
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If the motion information 206 is generated by a motion 
capture device, then the markers or transmitters must be placed in 
the key locations I as described above. in this case, the spatial 
parameter extractor 208 will have to do the appropriate processing 
necessary to turn; the motion capture device output into three- 
dimensional coordinates. The spatial parameter extractor 208 
generates spatial I parameters 210 which can be the three- 
dimensional coordinates of the key points or, alternatively, 
parameters that are used to drive the visual rendering module 122. 

The spatial parameters 210, linguistic representation of 
speech 216 and the expression parameters 222 all contain time 
stamps which are used to synchronize these components with each 
other. The spatial; parameters 210, linguistic representation of 
speech 216 and the; expression parameters 222 are stored in the 
parameter database: 218 for later use. 

Turning to FIG. 3, once the parameter database 3 02 has been 
generated, the spatial parameters 308, corresponding linguistic 
representation of speech 306 and optionally the corresponding 
expression parameters 310 are retrieved by the parameter retrieval 
module 304. The parameter retrieval module 304 insures that the 
torrect spatial pa?raraeters 308, linguistic representation of 
speech 306 and optionally corresponding expression parameters 310 
are retrieved such i that they are correlated. Typically, this is 
done by insuring that the time stamps on each of these parameters 
are. the same, though the various parameters may have different 
sampling rates. This means that spatial parameters may have 
thirty parameter vectors corresponding to a second of motion (when 
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the motion is recorded at thirty frames per second) whereas the 
linguistic representation of speech may have one hundred parameter 
vectors corresponding to a second of speech (where the speech is 
recorded at one hundred frames per second) . Due to the 
5 potentially differing sampling rates, the parameter retrieval 

module 304 may have to interpolate between parameters in order to 
provide synchronized spatial parameters 308, linguistic 
representation of speech 306 and expression parameters 310. 
The spatial parameters 308 are converted to raw spatial 

10 parameters 322 by a target processor 312. The raw spatial 
parameters 322 are neural network target parameters and are 
typically scaled versions of parameters which are suitable for 
driving a visual rendering module 122. This means that any 
scaling or conversion between spatial parameters extracted from a 

15 human speaker 202 and parameters necessary to drive the visual 
rendering module 122 are performed by the target processor 312. 
After the spatial parameters 3 08 have been scaled and converted to 
parameters that are suitable for driving a visual rendering module 
122, the target processor 312 typically must scale the parameters 

20 into raw spatial parameters 322 which can be used to train a 

neural network. This typically means normalizing the vectors by 
scaling each element in the processed spatial parameter vector 
into a range whose minimum is zero and whose maximum is one. 

Similarly, the linguistic representation of speech 306 is 

25 converted by the pre-processor 314 into neural network linguistic 
' parameters 316 which are parameters that are suitable for input to 
a neural network module 320. In the preferred embodiment, the 
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neural network linguistic parameters contain four components, 
called streams - 

As shown in FIG. 4, the first stream is called the breaks 
input 425 and contains twenty- six parameters describing the 
following: the degree of dis juncture between the previous word, 
the current word aind the following word; 

the duration of the previous, current and following words; the 
distances to the previous, current and following word boundaries; 
the distance to the highest fundamental frequency of speech in the 
current syntactic 'phrase; and the value of the highest fundamental 
frequency of speech for the current syntactic phrase. . 

The second stiream is called the prosodic input 426 and 
contains forty-sevbn parameters describing the following: 
distances and values of phrase accents of the previous, current 
and future phrases'; tone of the previous and current phrases; 
distances and values of the two previous, the two current and the 
two future pitch accents; total number of high pitch accents since 
the beginning of t^e phrase; and total number of down- stepped 
pitch accents since the beginning of the phrase. 

The third stream is called the phonemic context 427 and 
contains sixty- three parameters describing the following: the 
current phoneme name, the prominence of the current word, the 
stress of the current syllable, and the category of the current 
word which is preferably a part-of -speech tag. 

The fourth stafeam is called the duration/distance input 428 
and contains three ihundred and sixty-eight parameters which 
describe the following: duration of the previous five phoneme 
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boundaries; distances to the previous five phoneme boundaries, sum 
of one divided by the frame number for all frames of each of the 
previous five phonemes, duration to the following five phoneme 
boundaries, distances to the following five phoneme boundaries, 
5 sum of one divided by the frame number for all frames of each of 
the following five phonemes, and a description of the syntactic 
boundaries . 

The expression parameters 310 are converted to neural network 
expression parameters 318 by the pre-processor 314. The neural 

10 network expression parameters can be contained by a separate fifth 
stream or alternatively combined with the prosodic input 426 in 
the second stream. In the preferred embodiment, the processed 
expression parameters 318 contain six parameters to describe the 
degree of the six basic expressions. The processed expression 

15 parameters 318 are preferably added to the stream containing the 
prosodic input 426. 

In detail, FIG. 4 displays the architecture of the neural 
network module 112 in accordance with the present invention. The 
neural network is composed of a layer of processing elements with 

20 a predetermined specified activation function and at least one of 
the following: another layer of processing elements with a 
predetermined specified activation function, a multiple layer of 
processing elements with predetermined specified activation 
functions, a rule-based module that generates output based on 

25 internal rules and input to the rule-based module, a statistical 
system that generates output based on input and an internal 
statistical function and a recurrent feedback mechanism. The 
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neural network is! hand modularized according to speech domain 
expertise. 

The neural network contains two phoneme -to- feature blocks 402 
and 403 which use' rules to convert the unique phoneme identifier 
contained in both; the phonemic context 427 and the 
duration/distance; input 428 to a set of acoustic features such as 
sonorant, obstruant and voiced. The neural network also contains 
a recurrent buffer 415 which is a module that contains a recurrent 
feedback mechanism. This mechanism stores the output parameters 
for a specified number of previously generated frames and feeds a 
non- linear sampling of the previous output parameters back to 
other modules which use the output of the recurrent feedback 
mechanism 415 as input. 

The square blocks in FIG. 4 404-414 and a 416-420 are modules 
which contain a single layer of perceptrons. The neural network 
input layer is composed of several single layer perceptron modules 
404-409 which have no connections between each other. All of the 
modules in the input layer feed into the first hidden layer 410. 
The output from the recurrent buffer 415 is processed by a layer 
of perceptron modules 416-418. The information from the output of 
the recurrent buffjer, the recurrent buffer layer of perceptron 
modules 416-418 an^d the output of the first hidden layer 410 is 
fed into a second bidden layer 411, 412 which in turn feeds the 
output layers 413 knd 414. 

In the preferred embodiment, the neural network module 112 
generates both spejsch and spatial parameters from the same input. 
The output containis two modules 419 and 420 to generate the speech 
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parameters and two modules 413 and 414 to generate the spatial 
parameters. The first speech output module 419 computes the 
voicing and energy parameters for the speech frame, whereas the 
second speech output module 420 generates the line spectral 

5 frequencies which describe the spectral contour of the speech 

frame. The first spatial parameter output module 413 generates a 
first set of reference points which are the frequency and degree 
of oscillations for spatial parameters, whereas the second spatial 
parameter output module 414 generates a second set of reference 

10 points which are the coordinates of the spatial parameters, which 
later get converted to three-dimensional coordinates that are used 
to drive the visual rendering module 122. The second set of 
reference points are different from the first set of reference 
points. The neural network output module 401 combines the output 

15 from each module in the output layer 413, 414, 419 and 420 and 

converts the information into parameters that are suitable for use 
by external devices. 

FIG. 4 also shows tapped delay line input modules 421-424 
which can be used to provide a non- linear context window of the 

20 input streams. In the preferred embodiment, a tapped delay line 
module 423 is used to provide phonemic context 427 of a total of 
thirty prior and future phonemic context vectors. These are not 
just the thirty adjacent phonemes that are presented but are the 
vectors from frames -40, -36, -32, -28, -24, -20, -16, -12, -8, - 

25 6, -4, -2, -1, 0, 1, 2, 3, 4, 5, 7, 9, 11, 15, 19, 23, 27, 31, 35, 
39 and 43, where the negative frame numbers indicate previous 
frames and positive frame numbers indicate the future frames. 
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Each frame has a J>referable duration of five milliseconds. 
Alternatively the | breaks input 425, prosodic input 426 and 
duration/distance ! input 428 can be processed by tapped delay line 
modules 421, 422 and 424, respectively. 

Since the nurtber of neurons is necessary in defining a neural 
network, the following table shows the details about each module: 
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The neural network is trained using the above procedure by 
using a back propagation of errors algorithm. A gradient descent 
technique or a Bayesian technique may alternatively be used to 
train the neural network. 

Alternatively, the linguistic representation of speech can be 
the output of a spieech recognizer 105. The speech recognizer 
would have speech 103 as input and would generate the linguistic 
parameters described above. The neural network would preferably 
be trained from more than one speaker and the pre-processor 108, 
neural network module 112 and post -processor 116 would all 
preferably receive: speaker profile information which would be 
generated by the sjpeech recognizer. The speaker profile is 
coupled to at least one of the text-to-linguistics module, the 
neural network based parameter system and the visual rendering 
module. Moreover, i the speaker profile has at least one of the 
following: visual I profiles and voice prof iles (the visual 
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profiles are associated with the voice profiles determined by a 
user identifier) . 

There are two methods of using the speaker profile for the 
neural network based parameter system 138. The speaker profile 

5 would preferably be used to select predetermined weights for the 
neural network but alternatively the speaker profile could be used 
to change predetermined inputs which would effect the output even 
though the weights would not change. In this alternative method, 
the neural network would have to be trained using the 

10 predetermined inputs which would remain at constant values for the 
training data that represents one human. The values would then be 
altered to a new predetermined constant value for a new human that 
is used to generate data for the parameter database 218 and 302. 
The pre-processor 108 and the post -processor 116 may also 

15 receive the speaker profile for the neural network based parameter 
system 138. The pre-processor 108 changes at least one of the 
predetermined neural network expression parameters 148 and the 
neural network linguistic parameters 110 based on the speaker 
profile. The post -processor 116 uses the speaker profile to 

20 modify the predetermined model parameters 120. In the preferred 
embodiment, the speaker profile is also used by the neural network 
module 112 to alter the parametric representation of speech 134 
which ultimately changes the characteristics of the neural network 
synthesized speech 144. This is preferably done by changing the 

25 weights of the neural network according to the speaker profile. 
The speaker profile for the visual rendering module 140 may 
contain information such as age, skin color, hair color, eye color 
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and dimensions of i facial features. Whenever these general setup * 
parameters are updated, the visual rendering module 122 will cause 
the visually rendered model 124 to change its appearance in a 
predetermined way J 

The text-to-linguistics module 104 preferably receives a 
speaker profile fpr text-to-linguistics module 134 which would 
include at least dne of language identifier, socio- linguistic 
information and physical information relevant to speech 
production. This jspeaker profile causes the text- to- linguistics 
module 104 to alter its output in a predetermined manner. For 
example, if the sj^eaker profile identified that the speaker was 
from England, theni-the linguistic representation of speech 106 
would contain British pronunciation of words in the text 102. 

In the preferred embodiment, the text -to- linguistics module 
104 also generates; expression parameters 146 which are preferably 
derived from the text 102. The expression parameters 146 describe 
expressions such a;s sadness, anger, joy, fear, disgust or 
surprise. Alternatively, these expression parameters 146 might be 
extracted from text 102 that is written with some markup language. 
As an example, the; text 102 might have the following sentence: 
g I am <emphasize>; very late < \emphasi2e>.^ The text-to- 
linguistics module! 104 would generate the expression parameter 146 
EMPHASIZED which wbuld be active during the duration of the phrase 
"very lateg . These expression parameters 146 effect the model 
parameters 120 used to generate the visually rendered model 124 
and one of the parametric representation of speech 134 and the 
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alternatively synthesized speech 128 used to generate the visually' 
rendered model with speech 132. 

An alternative embodiment uses a non-neural network based 
linguistics-to-speech module 126 to convert the linguistic 
5 representation of speech 106 into alternatively synthesized speech 
128. In this embodiment, the raw speech parameters 115, 
parametric representation of speech 134, waveform synthesizer 142 
and neural network synthesized speech 144 would not be present. 
The speaker profile for linguistics-to-speech module 126 would 

10 provide information which would change the characteristics of the 
alternatively synthesized speech 128, As an example, the speaker 
profile might contain information such as the gender of the 
speaker which would effect the pitch and spectral parameters of 
the alternatively synthesized speech. The linguistics-to-speech. 

15 module 126 may be combined with the text-to-linguistics module 104 
in at least one of a speech synthesis system, text-to-speech 
system and a dialog system. 

In the preferred embodiment the coding of the head position 
in terms of horizontal rotation, vertical rotation and tilt is 

20 coded as an amplitude and frequency of oscillation. The amplitude 
describes the limits or degree of movement and the frequency 
describes the speed of the movement. Tilt is defined as the 
rotation about the horizontal axis which passes through the front 
and back of the head and approximately intersects the bridge of . 

25 the nose. The combination of the three rotation parameters 
defines the direction that the head is facing. 
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Other non-art iculatory gestures such as blinking, smiling, 
shoulder shrugs, head nodding, winking and coughing are also 
generated by the rieural network-based parameter system 118 in the 
preferred embodiment. These events are preferably coded by using 
the method of generating the amplitude and frequency of the 
events, as descritied above. The speaker profile for neural 
network-based parameter system 138 and the speaker profile for the 
visual rendering miodule 140 preferably contains parameters 
describing which o;f the predetermined events are desired to be 
displayed in the visually rendered model 124 and the degree at 
which they should be present. Note that the device of the present 
invention may be implemented in a text-to-speech system, a speech 
synthesis system obr a dialog system . 

A general purpose computer implementing the present invention 
is shown in FIG. SL A computer monitor 510 having a screen 512 
displaying a virtual actor 514, a central processing unit 516 
having a microprocessor 518, a keyboard 520 and a data storage 
medium 522 (e.g., computer discs, floppy discs, etc.) are shown. 
The data storage mfedium 522 has stored thereon a set of 
instructions which;, when loaded into the computer causes the 
computer to act as' a neural network. When the computer is 
stimulated by linguistic parameters, the linguistic parameters are 
converted into raw! spatial parameters. The storage medium 522 
further has stored! thereon a set of speaker profiles 134, 136, 138 
and 140, including; at least visual profile data and yoice profile 
data. When the visual profile data and the voice profile data are 
loaded into the computer, they cause the computer to display the 
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virtual actor 514 having a visual appearance determined by the 
visual profile data and a voice determined by the voice profile 
data. 

Alternatively, the neural network based parameter system 118 

5 can be driven by an acoustic representation of speech, giving, as 
its output, spatial parameters which include model parameters 
(facial parameters and gestures of the face and body such as, 
eyebrow movement, head tilt, shoulder shrugs, hand waving, etc.) . 
These spatial parameters can be in addition to spatial parameters 

10 of lip movement- The acoustic representation of speech would 
preferably come from a speech recognition device but could come 
from a speech coder. The neural network is trained using the 
acoustic representation of speech as an input and generates the 
raw spatial parameters as described above. Instead of containing 

15 information on a select number of speakers, the parameter database 
would contain information recorded from a large predetermined 
number of human speakers . 

The present invention may be embodied in other specific forms 
without departing from its spirit or essential characteristics. 

20 The described embodiments are to be considered in all respects 
only as illustrative and not restrictive. The scope of the 
invention is, therefore, indicated by the appended claims rather 
than by the foregoing description. All changes which come within 
the meaning and range of equivalency of the claims are to be 

25 embraced within their scope. 
We claim: 
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1. A data storage medium having stored thereon a set of speaker 
profiles, including at least visual profile data and voice profile 
data, which, when loaded into a computer causes the computer to 
display a virtual actor having a visual appearance determined by 
the visual profile; data and a voice determined by the voice 
profile data, 

2, A computer which, when loaded with the set of speaker 
profiles of claim 1, displays a virtual actor, wherein the 
appearance of the virtual actor is selectable dependent on 
selection of a speaker profile from the set of speaker profiles. 
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